produces a vector of size K such that sum(beta) = 0. The unconstrained representation requires only K - 1 values because the last is determined by the first K - 1.
A sum to zero vector is exactly what the name suggests. A vector where the sum of the elements equals 0. If you put a normal prior on the zero-sum vector the resulting variance will be less than the intended normal variance. To get the same variance as the intended normal prior do
where sigma is the intended standard deviation. FYI, it’s a bit more efficient to pre-calculate the sqrt(N * (N - 1.0)) in transformed_data. The general result to get a given variance from a normal with linear constraints is in: Fraser, D. A. S. (1951). Normal Samples With Linear Constraints and Given Variances. Canadian Journal of Mathematics, 3, 363–366. doi:10.4153/CJM-1951-041-9.
Prior to Stan 2.36, a sum-to-zero constraint could be implemented in one of two ways:
As a “hard” sum to zero constraint, where the parameter is declared to be an \(N-1\) length vector with a corresponding \(N\)-length transformed parameter whose first \(N-1\) elements are the same as the corresponding parameter vector, and the \(N^{th}\) element is the negative sum of the \(N-1\) elements.
As a “soft” sum to zero constraint with an \(N\)-length parameter vector whose sum is constrained to be within \(\epsilon\) of \(0\).
Up until now, users had to choose between the hard or soft sum-to-zero constraint, without clear guidance. As a general rule, for small vectors, the hard sum-to-zero constraint is more efficient; for larger vectors, the soft sum-to-zero constraint is faster, but much depends on the specifics of the model and the data.
For small \(N\) and models with sensible priors, the hard sum-to-zero is usually satisfactory. But as the size of the vector grows, it distorts the marginal variance of the \(N^{th}\). Given a parameter vector: \[
x_1, x_2, \dots, x_{N-1} \sim \text{i.i.d. } N(0, \sigma^2)
\] by the properties of independent normal variables, each of the free elements \(x_1, \ldots, x_{N-1}\) has variance \(\sigma^2\). However, the \(N^{th}\) element is defined deterministically as: \[
x_N = -\sum_{i=1}^{N-1} x_i
\] and its variance is inflated by a factor of \(N-1\). \[
\operatorname{Var}(x_N) = \operatorname{Var}\Bigl(-\sum_{i=1}^{N-1} x_i\Bigr)
= \sum_{i=1}^{N-1} \operatorname{Var}(x_i)
= (N-1)\sigma^2.
\] For large vectors, MCMC samplers struggle with the hard sum-to-zero constraint, as every change to any of the \(N-1\) elements also requires a corresponding change to the \(N^{th}\) element; balancing these changes introduces potential non-identifiabilities.
The soft sum-to-zero constraint is problematic for the following reasons.
The tolerance \(\epsilon\) (the scale of the penalty) must be chosen by the analyst. Too large, and the result is too far from zero to be effective, too small and the sampler cannot satisfy the constraint.
The soft constraint only penalizes deviations from zero, leading to weaker identifiability of the parameters. This can lead to slow convergence and mixing, as the sampler explores nearly non-identified regions.
The marginal variances may not reflect the intended prior.
The sum_to_zero_vector transform ensures that each element of the resulting constrained vector has the same variance. This improves the sampler performance, providing fast computation and good effective sample size. This becomes increasingly noticeable as models increase in size and complexity. To demonstrate this, in this notebook we consider two different classes of models:
Multi-level regressions for binomial data with group-level categorical predictors.
Spatial models for areal data.
For these models, we provide three implementations which differ only in the implementation of the sum-to-zero constraint: the built-in sum_to_zero_vector, and the hard and soft sum-to-zero implementations. We fit each model to the same dataset, using the same random seed, and then compare the summary statistics for the constrained parameter values. Since the models are equivalent, we expect that all three implementations should produce the same estimates; what differs is the speed of computation, as measured by effective samples per second.
Multi-level Models with Group-level Categorical Predictors
In this section we consider a model which estimates per-demographic disease prevalence rates for a population. The model is taken from the Gelman and Carpenter, 2020 Bayesian Analysis of Tests with Unknown Specificity and Sensitivity. It combines a model for multilevel regression and post-stratification with a likelihood that accounts for test sensitivity and specificity.
The data consists of:
A set of per-demographic aggregated outcomes of a diagnostic test procedure with unequal number of tests per demographic.
A corresponding set of demographic descriptors encoded as a vector of categorical values. In this example these are named sex, age, eth, and edu, but there can be any number of demographic predictors with any number of categories.
The specified test sensitivity and specificity
In order to fit this model, we need to put a sum-to-zero constraint on the categorical variables.
The Stan model
The full model is in file binomial_4_preds_ozs.stan. It provides an estimate of the true prevalence based on binary tests with a given (or unknown) test sensitivity and specificity as follows.
transformed parameters {// true prevalencevector[N] p = inv_logit(beta_0 + beta_sex * sex_c + beta_age[age] + beta_eth[eth] + beta_edu[edu]);// incorporate test sensitivity and specificity.vector[N] p_sample = p * sens + (1 - p) * (1 - spec);}model { pos_tests ~ binomial(tests, p_sample); // likelihood ...
To constrain the group-level parameters age, eth, and edu, we use the sum_to_zero_vector.
In order to put a standard normal prior on beta_age, beta_eth, and beta_edu, we need to scale the variance, as suggested above. The scaling factors are pre-computed in the transformed data block, and applied as part of the prior.
To investigate the predictive behavior of this model at different timepoints in a pandemic, we have written a data-generating program to create datasets given the baseline disease prevalence, test specificity and sensitivity, and the desired number of diagnostic tests. In the generated quantities block we use Stan’s PRNG functions to populate the true weights for the categorical coefficient vectors, and the relative percentages of per-category observations. Then we use a set of nested loops to generate the data for each demographic, using the PRNG equivalent of the model likelihood.
The full data-generating program is in file gen_binomial_4_preds.stan. The helper function simulate_data in file utils.py sets up the data-generating program according to the specified number of demographic categories, approximate average number of observations per category, and baseline disease prevalence, test specificity and sensitivity. This allows us to create datasets for large and small populations and for finer or more coarse-grained sets of categories. The larger the number of strata overall, the more observations are needed to get good coverage. Because the modeled data pos_tests is generated according to the Stan model’s likelihood, the model is a priori well-specified with respect to the data.
Note
Because the data-generating parameters and percentage of observations per category are generated at random, some datasets may have very low overall disease rates and/or many unobserved strata, and will therefore be pathologically hard to fit. Using a specified seed for the data-generating program avoids this problem.
The data generating program creates unequal numbers of tests per demographic. The following tables summarize the resulting distribution.
Show Code
print("Small Dataset")print("tests total "+str(np.sum(data_small['tests'])))small_df = pd.DataFrame({'tests per demographic': data_small['tests']}).describe()small_df.drop('count', inplace=True)small_df
Small Dataset
tests total 4451
tests per demographic
mean
16.49
std
18.51
min
0.00
25%
5.00
50%
10.00
75%
21.75
max
119.00
Show Code
print("Large Dataset")print("tests total "+str(np.sum(data_large['tests'])))large_df = pd.DataFrame({'tests per demographic': data_large['tests']}).describe()large_df.drop('count', inplace=True)large_df
Large Dataset
tests total 53863
tests per demographic
mean
199.49
std
218.00
min
4.00
25%
61.25
50%
123.00
75%
258.50
max
1401.00
Show Code
print("Tiny Dataset")print("tests total "+str(np.sum(data_tiny['tests'])))tiny_df = pd.DataFrame({'tests per demographic': data_tiny['tests']}).describe()tiny_df.drop('count', inplace=True)tiny_df
Tiny Dataset
tests total 947
tests per demographic
mean
3.51
std
4.38
min
0.00
25%
1.00
50%
2.00
75%
5.00
max
28.00
Model Fitting
We fit each model to the the large and small datasets. We record the seed used for the first run and use it for all subsequent fits; this insures that the models have the same set of initial parameter values for all parameters except the constrained params beta_eth, beta_edu, and beta_age.
If all three implementations are correct, we expect that they will produce the same estimates for all parameters. As always, we check the R-hat values and effective sample sizes. To do this, we tabulate the summary statistics from each of the three models, fit to the large and small datasets.
Model efficiency is measured by iterations per second, however, as the draws from the MCMC sampler may be correlated, we need to compute the number of effective samples across all chains divided by the total sampling time - this is ESS_bulk/s, the effective samples per second. The following table shows the average runtime for 100 runs of each of the three models on large and small datasets. The sum_to_zero_vector is far more efficient than either the hard or soft sum-to-zero constraint.
All models have R-hat values of 1.00 for all group-level parameters and high effective sample sizes.
The models recover the sign of the parameter, but not the exact value. In the case of lots of observations and only a few categories they do a better job of recovering the true parameters.
In almost all cases, estimates for each parameter are the same across implementations to 2 significant figures. In a few cases they are off by 0.01; where they are off, the percentage of observations for that parameter is correspondingly low.
This is as expected; all three implementations of the sum-to-zero constraint do the same thing; the sum_to_zero_vector implementation is the most efficient and therefore fastest.
Prior predictive checks
Prior predictive checks simulate data directly from the prior, in the absense of any observed data. Here we use the simulated dataset to examine the prior marginal variances of the elements of the sum-to-zero vector under the hard-sum-to-zero constraint and the built-in sum_to_zero transform. In particular, we are interested in the marginal variances of the elements of the sum-to-zero constrained parameters.
The above summaries show how the hard sum-to-zero constraint distorts the variance of the \(N^{th}\) element. This is only a problem for very sparse datasets, where the prior swamps the data. With data, this problem goes away; to see this we fit the hard sum-to-zero constraint model on the tiny and small datasets.
Marginal variances of the hard sum-to-zero constraint, tiny dataset
Comparison of these three models with different data regimes demonstrates the following.
For a multi-level model with group-level categorical predictors the sum_to_zero_vector provides fast results and good effective sample sizes for both datasets.
Model binomial_4preds_ozs.stan. shows how to properly scale the variance of a sum_to_zero_vector constrained parameter in order to put a standard normal prior on it.
Prior predictive checks demonstrate the difference between the marginal variances of the sum_to_zero_vector and hard sum-to-zero implementations.
Spatial Models with an ICAR component
Spatial auto-correlation is the tendency for adjacent areas to share similar characteristics. Conditional Auto-Regressive (CAR) and Intrinsic Conditional Auto-Regressive (ICAR) models, first introduced by Besag, 1974, account for this by pooling information from neighboring regions. The BYM model, (Besag, York, Mollié, 1991) extends a lognormal Poisson model plus ICAR component for spatial auto-correlation by adding an ordinary random-effects component for non-spatial heterogeneity. The BYM2 model builds on this model and subsequent refinements.
The ICAR, BYM2, and BYM2_multicomp models are more fully explained in a series of notebooks available from GitHub repo: https://github.com/mitzimorris/geomed_2024, see notebooks:
The key element of the BYM2 model is the ICAR component. Its conditional specification is a multivariate normal random vector \(\mathbf{\phi}\) where each \({\phi}_i\) is conditional on the values of its neighbors.
The joint specification rewrites to a Pairwise Difference,
Each \({({\phi}_i - {\phi}_j)}^2\) contributes a penalty term based on the distance between the values of neighboring regions. However, \(\phi\) is non-identifiable, constant added to \(\phi\) washes out of \({\phi}_i - {\phi}_j\). Therefore, a sum-to-zero constraint is needed to both identify and center \(\phi\).
The Stan implementation of the ICAR component computes the sum of the pairwise distances by representing the spatial adjacency matrix as a array of pairs of neighbor indices.
data { ...// spatial structureint<lower = 0> N_edges; // number of neighbor pairsarray[2, N_edges] int<lower = 1, upper = N> neighbors; // columnwise adjacent
The ICAR prior comes into the model as parameter phi.
The ICAR model requires that the neighbor graph is fully connected for two reasons:
The joint distribution is computed from the pairwise differences between a node and its neighbors; singleton nodes have no neighbors and are therefore undefined.
Even if the graph doesn’t have any singleton nodes, when the graph has multiple connected components a sum-to-zero constraint on the entire vector fails to properly identify the model.
Because the BYM2 model includes an ICAR component, it too requires a fully connected neighbor graph. In order to apply the BYM2 model to the full NYC dataset, it is necessary to extend the BYM2 model to account for disconnected components and singleton nodes.
First we use the BYM2 model for fully connected graphs to compare implementations of the sum-to-zero vector on phi, and then we extend the BYM2 model following the INLA implementation developed by Freni-Sterrantino et al. in 2018 and presented in A note on intrinsic Conditional Autoregressive models for disconnected graphs.
The dataset we’re using is that used in the analysis published in 2019 Bayesian Hierarchical Spatial Models: Implementing the Besag York Mollié Model in Stan. It data consists of motor vehicle collisions in New York City, as recorded by the NYC Department of Transportation, between the years 2005-2014, restricted to collisions involving school age children 5-18 years of age as pedestrians. Each crash was localized to the US Census tract in which it occurred, using boundaries from the 2010 United States Census, using the 2010 Census block map for New York City. New York City is comprised of several large and small islands, a chunk of the mainland (the Bronx), and the southern tip of Long Island Sound (Brooklyn and Queens). To analyze the full New York City dataset requires the extended BYM2 model, which we do in the second part of this section.
Model 1: The BYM2 model, Riebler et al. 2016
We compare three implementations of the BYM2 model which differ only in their implementation of phi, which is constrained to sum to zero.
BYM2 v1: sum_to_zero_vector
Parameter phi is declared as a sum_to_zero_vector. See file bym2_ozs.stan.
BYM2 v2: Hard sum-to-zero constraint
Parameter phi_raw is an vector of size N-1 and the N-length vector phi is computed in the transformed parameters block. See file bym2_hard.stan.
BYM2 v3: soft sum-to-zero constraint
Parameter phi is a vector of size N and its sum is soft-centered on \(0\). See file bym2_soft.stan.
Data assembly
The inputs to the BYM2 model are
The Poisson regression data
int<lower=0> N - number of regions
array[N] int<lower=0> y - per-region count outcome
vector<lower=0>[N] E - the population of each region (a.k.a. “exposure”),
int<lower=1> K - the number of predictors
matrix[N, K] xs - the design matrix
The spatial structure
int<lower = 0> N_edges - the number of neighbor pairs
real tau - the scaling factor, introduced in the BYM2
Note
The scaling factor tau was introduced by Riebler et al so that the variance of the spatial and ordinary random effects are both approximately equal to 1, thus allowing for a straightforward estimate of the amount of spatial and non-spatial variance. We have written a helper function called get_scaling_factor, in file utils_bym2.py which takes as its argument the neighbor graph and computes the geometric mean of the corresponding adjacency matrix.
To assemble the spatial data from the US Census data maps we use a set of utilities described more fully in notebooks 2 and 5 of the geomed_2024 Spatial Data Processing in Stan workshop. The neighbors graph of New York City consists of several large connected components, notably Brooklyn and Queens, the Bronx, Manhattan, and Staten Island, and a few other smaller components and singletons.
Show Code
from utils_nyc_map import nyc_sort_by_comp_sizenyc_geodata = gpd.read_file(os.path.join('data', 'nyc_study.geojson'))(nyc_nbs, nyc_gdf, nyc_comp_sizes) = nyc_sort_by_comp_size(nyc_geodata)from splot.libpysal import plot_spatial_weights print("New York City neighbors graph")plot_spatial_weights(nyc_nbs, nyc_gdf)
New York City neighbors graph
Because the BYM2 model requires that the neighbors graph be fully connected, for this analysis, we restrict this analysis to just Brooklyn and Queens, the largest subcomponent.
The BYM2 model requires many warmup iterations in order to reach convergence for all parameters, including hyperparameters rho and sigma. We run all three models using the same seed, in order to make the initial parameters as similar as possible.
If all three implementations are correct, we expect that they will produce the same estimates for all parameters. As always, we check the R-hat values and effective sample sizes. To do this, we tabulate the summary statistics from each of the three models for the hierarchical parameters beta, sigma, and rho.
The sum_to_zero vector is the most efficient. In contrast to the outcomes for the binomial model above, the soft sum-to-zero constraint is more efficient than the hard sum-to-zero constraint.
Model 2: The BYM2_multicomp model, Freni-Sterrantino et al, 2018
In order to apply the BYM2 model to the full NYC dataset, it is necessary to extend the BYM2 model to account for disconnected components and singleton nodes. (The alternative is to add edges to the neighbor graphs so that the neighbors graph is a single connected component. This is problematic, see Notebook 6 The BYM2_multicomp model in Stan for details.)
Singleton nodes (islands) are given a standard Normal prior
Compute per-connected component scaling factor
Impose a sum-to-zero constraint on each connected component
We have followed these recommendations and implemented this model in Stan. The full model is in file bym2_multicomp.stan
For this case study, we compare 2 implementations of the BYM2_multicomp model: one which uses the sum_to_zero_vector and one which implements the soft sum-to-zero constraint.
It is necessary to constrain the the elements of the spatial effects vector phi on a component-by-component basis. Stan’s slicing with range indexes, provides a way to efficiently access each component. The helper function nyc_sort_by_comp_size both sorts the study data by component and adds the component index to the geodataframe.
In the BYM2 model for a fully connected graph the sum-to-zero constraint on phi is implemented directly by declaring phi to be a sum_to_zero_vector, which is a constrained parameter type. The declaration:
sum_to_zero_vector[N] phi; // spatial effects
creates a constrained variable of length \(N\), with a corresponding unconstrained variable of length \(N-1\).
In order to constrain slices of the parameter vector phi, we do the following:
In the parameters block, we declare the unconstrained parameter phi_raw as an regular vector vector (instead of a sum_to_zero_vector).
For a fully connected graph of size \(N\), the size of the unconstrained sum-to-zero vector is \(N-1\). For a disconnected graph with \(M\) non-singleton nodes, the size of phi_raw is \(M\) minus the number of connected components.
The constraining transform is broken into two functions:
function zero_sum_constrain, the actual constraining transform, which corresponds directly to the built-in zero_sum_vector transform.
function zero_sum_constrain_components, which handles the slicing, and calls zero_sum_constrain on each component.
/** * Constrain sum-to-zero vector * * @param y unconstrained zero-sum parameters * @return vector z, the vector whose slices sum to zero */vector zero_sum_constrain(vector y) {int N = num_elements(y);vector[N + 1] z = zeros_vector(N + 1);real sum_w = 0;for (ii in1:N) {int i = N - ii + 1; real n = i;real w = y[i] * inv_sqrt(n * (n + 1)); sum_w += w; z[i] += sum_w; z[i + 1] -= w * n; }return z; }
zero_sum_components: slices vector phi by component, applies constraining transform to each.
/** * Component-wise constrain sum-to-zero vectors * * @param phi unconstrained vector of zero-sum slices * @param idxs component start and end indices * @param sizes component sizes * @return vector phi_ozs, the vector whose slices sum to zero */vector zero_sum_components(vector phi,array[ , ] int idxs,array[] int sizes) {vector[sum(sizes)] phi_ozs;int idx_phi = 1;int idx_ozs = 1;for (i in1:size(sizes)) { phi_ozs[idx_ozs : idx_ozs + sizes[i] - 1] = zero_sum_constrain(segment(phi, idx_phi, sizes[i] - 1)); idx_phi += sizes[i] - 1; idx_ozs += sizes[i]; }return phi_ozs; }
Note
The constraining transform is a linear operation, leading to a constant Jacobian determinant which is therefore not included. As of Stan 2.36, transforms which include a Jacobian adjustment can do so with the jacobian += statement and must have names ending in _jacobian. See section functions implementing change-of-variable adjustments in the Stan User’s Guide chapter on user-defined functions for details.
Data Assembly
The requisite data-preprocessing of the spatial data is handled by Python helper functions. The helper function nyc_soft_by_comp_size adds component info to the geodataframe. It returns the expanded geodataframe, the neighbor graph, and a list of component sizes.
Show Code
from utils_nyc_map import nyc_sort_by_comp_sizefrom utils_bym2 import nbs_to_adjlist, get_scaling_factorsnyc_geodata = gpd.read_file(os.path.join('data', 'nyc_study.geojson'))(nyc_nbs, nyc_gdf, nyc_comp_sizes) = nyc_sort_by_comp_size(nyc_geodata)nyc_nbs_adj = nbs_to_adjlist(nyc_nbs)component_sizes = [x for x in nyc_comp_sizes if x >1]scaling_factors = get_scaling_factors(len(component_sizes), nyc_gdf)
The NYC predictors are on very different scales. To correct this, we log transform two of the four columns of the design matrix.
We can compare the sum_to_zero_vector implementation to the corresponding soft sum-to-zero constraint. The full model is in file bym2_multicomp_soft.stan
In model bym2_soft.stan the soft sum-to-zero constraint is combined directly with the ICAR prior:
For the BYM2_multicomp model, this operation is carried out in two steps. First the ICAR prior is applied to phi, next we iterate through the components, applying the sum-to-zero constraint to each in turn.
The data inputs are the same. To ensure (roughly) the same initialization, we reuse the seed from bym2_multicomp_ozs_fit. This model fits very slowly and requires increasing the max_treedepth; at the default setting, all iterations hit this limit.
The BYM2 model has more data and a relatively complex multilevel structure. Before Stan 2.36, for this model and dataset, the soft sum-to-zero constraint was much faster than the hard sum-to-zero constraint. Here we show that the sum_to_zero_vector greatly improves the run time.
For the BYM2_multicomp model, model bym2_multicomp.stan shows how to implement the sum_to_zero_vector constraining transform as a Stan function. A comparable implementation using the soft sum-to-zero implementation is painfully slow. Both implementation get exactly the same estimates, which simply confirms that both models are correctly implemented. The dramatic difference in run times speaks for itself.
Conclusion: the sum_to_zero_vector just works!
The more complex the model, the greater the need for the sum_to_zero_vector. When considering the effective sample size, it is important to remember that what is most important is effective samples per second. In these experiments, the sum_to_zero_vector models consistently have the highest ESS/sec.
Key takeaways:
In the first part of this case study, we show how to use the sum_to_zero_vector for categorical predictors in a hierarchical model, and how to adjust the marginal variances for a normal prior.
In the second part of this case study we see how to implement the constraining transform used for the sum_to_zero_vector as a Stan function in order to handle the case where a vector is the concatenation of several sum-to-zero vectors.